1 Introduction

This template is meant to provide a starting point for exploratory data analysis. It includes many other packages Lizzie has built. To an extent, you can plug in your own data and use this template as-is, but every data project is unique, and I recommend using this more as a guidance tool than a plug-and-chug template.

1.1 Data Summary

I like to begin my EDA by using this package to give an overall summary of data. I create a mock dataframe here, but you can read in your own.

mydata <- data.frame(pred1 = rep(1:4,10),pred2=rep(1:5,8),missvar=as.numeric(c("NA",rep(1:2,19),"NA")))

mydata$groupvar1<-c(rep("A",20),rep("B",20))
mydata$groupvar2<-c(rep("A",10),rep("B",20),rep("C",10))
mydata$weight<-c(rep(1,30),rep(0,10))

dfSummary(mydata, plain.ascii = FALSE, style = "grid", graph.magnif = .75, valid.col=FALSE, tmp.img.dir = "/tmp")

1.1.1 Data Frame Summary

mydata
Dimensions: 40 x 6
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 pred1
[integer]
Mean (sd) : 2.5 (1.1)
min < med < max:
1 < 2.5 < 4
IQR (CV) : 1.5 (0.5)
1 : 10 (25.0%)
2 : 10 (25.0%)
3 : 10 (25.0%)
4 : 10 (25.0%)
0
(0.0%)
2 pred2
[integer]
Mean (sd) : 3 (1.4)
min < med < max:
1 < 3 < 5
IQR (CV) : 2 (0.5)
1 : 8 (20.0%)
2 : 8 (20.0%)
3 : 8 (20.0%)
4 : 8 (20.0%)
5 : 8 (20.0%)
0
(0.0%)
3 missvar
[numeric]
Min : 1
Mean : 1.5
Max : 2
1 : 19 (50.0%)
2 : 19 (50.0%)
2
(5.0%)
4 groupvar1
[character]
1. A
2. B
20 (50.0%)
20 (50.0%)
0
(0.0%)
5 groupvar2
[character]
1. A
2. B
3. C
10 (25.0%)
20 (50.0%)
10 (25.0%)
0
(0.0%)
6 weight
[numeric]
Min : 0
Mean : 0.8
Max : 1
0 : 10 (25.0%)
1 : 30 (75.0%)
0
(0.0%)

1.2 Methodology

Normally, this is where I’d describe my methodology, but since this is a template, I’ll instead use this section to show to how to display data in a table format.

Most datasets I work with are too big to be displayed as a table, but for the purposes of showing how to do this, you can view the full dataset this way:

knitr::kable(mydata)
pred1 pred2 missvar groupvar1 groupvar2 weight
1 1 NA A A 1
2 2 1 A A 1
3 3 2 A A 1
4 4 1 A A 1
1 5 2 A A 1
2 1 1 A A 1
3 2 2 A A 1
4 3 1 A A 1
1 4 2 A A 1
2 5 1 A A 1
3 1 2 A B 1
4 2 1 A B 1
1 3 2 A B 1
2 4 1 A B 1
3 5 2 A B 1
4 1 1 A B 1
1 2 2 A B 1
2 3 1 A B 1
3 4 2 A B 1
4 5 1 A B 1
1 1 2 B B 1
2 2 1 B B 1
3 3 2 B B 1
4 4 1 B B 1
1 5 2 B B 1
2 1 1 B B 1
3 2 2 B B 1
4 3 1 B B 1
1 4 2 B B 1
2 5 1 B B 1
3 1 2 B C 0
4 2 1 B C 0
1 3 2 B C 0
2 4 1 B C 0
3 5 2 B C 0
4 1 1 B C 0
1 2 2 B C 0
2 3 1 B C 0
3 4 2 B C 0
4 5 NA B C 0
#you can algin with "c" for center and "r" for right. Aligning multiple columns, you can align all of them by tripling the letter, for example "rrr" to right align all

knitr::kable(as.data.frame(table(mydata$pred1)),align = "l")
Var1 Freq
1 10
2 10
3 10
4 10

2 Initial Analyses

2.1 Factor Visualizations

First, we create one-way plots for our factor variables. I show some functions I like to use to make this simple. I also show the CatCorr package which is also available on my github for determining when factor variables are similar.

#you can replace this list with factor variables in your data if you wish
facvars<-c("groupvar1","groupvar2","weight")

factor.df<- mydata[,which(colnames(mydata) %in% facvars)]

factorplot<-function(i){
  ggplot(factor.df, aes(x=factor.df[,i],weight=weight))+
    geom_bar(aes(y=(..count..)/sum(..count..)))+
    xlab(colnames(factor.df)[i])+
    #uncomment if xlabels should be turned sideways if many categories
    #theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))+
    ylab("Proportion")
}

factorplot(1)

factorplot(2)

Now, we use a package I built that harnesses Cramer’s V as a proxy for correlation between factor variables.

#install.packages("devtools")
#devtools::install_github("easoneli176/CatCorr")

library(CatCorr)
catcorrplot(factor.df)

2.2 Numeric Visualizations

Recall the summary gives numeric one-way data, so this is only needed if you wish to utilize weights or like different types of visualizations for numeric distributions.

#replace these with a list of your numeric variables & weighting variable if needed
numvars<-c("pred1","pred2","weight")

num.df<-mydata[,which(colnames(mydata) %in% numvars)]

num.df$dummy<-rep(1,dim(num.df)[1])

ggplot(num.df, aes(x=as.factor(dummy),y=pred1,weight=weight))+geom_violin()+xlab("Distribution") #uncomment if need to zoom in due to outliers + coord_cartesian(ylim = c(0,1))

ggplot(num.df, aes(x=as.factor(dummy),y=pred2,weight=weight))+geom_violin()+xlab("Distribution")

Now, we evaluate the correlations:

#Note that missing rows or stagnant rows will break this code
num.df<-num.df[,-which(colnames(num.df) %in% c("dummy"))]

corm<-cor(num.df)

library(corrplot)

corrplot(corm, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)

3 Clean Data

3.1 Data Cleaning

Data cleaning can create a myriad of steps. Here, we highlight a package I created that can make imputing missing values easier.

#install.packages("devtools")
#devtools::install_github("easoneli176/Numpute")

library(Numpute)

numpute_example<-numpute(mydata,"missvar","groupedmean",facvar=c("groupvar1","groupvar2"))

knitr::kable(numpute_example,align="l")
missvar_imputed missvar_missingind
1.444444 1
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.000000 0
2.000000 0
1.555556 1
#append to dataset:
mydata<-mydata[,-which(colnames(mydata) %in% c("missvar"))]

mydata<-as.data.frame(cbind(mydata,numpute_example))

3.2 Feature Engineering

Feature engineering is an art form. Here, we showcase a package I built that can help to find groups of correlated variables in large datasets:

#create variable to be correlated:
num.df$corr<-num.df$pred1*2+rbinom(dim(num.df)[1],1,prob=.3)*.01
#install.packages("devtools")
#devtools::install_github("easoneli176/CorrClust")
library(CorrClust)

cc<-CorrClust(num.df,.8)

cc %>%
  kable(format = "html", col.names = colnames(cc)) %>%
  kable_styling() %>%
  kableExtra::scroll_box(width="100%", height = "300px")
Cluster
pred1 1
pred2 2
weight 3
corr 1
cc$Pred<-rownames(cc)

clusters<-unique(cc[duplicated(cc$Cluster),]$Cluster)

We see one cluster with the variable we created for this demo to be correlated. We show code below to find the variables in the cluster:

cc2<-cc[cc$Cluster == clusters[1],]

cc2$Pred

[1] “pred1” “corr”

3.3 Updated Data Summary

Once the data has been cleaned, it’s nice to have a new summary of the updated data:

dfSummary(mydata, plain.ascii = FALSE, style = "grid", graph.magnif = .75, valid.col=FALSE, tmp.img.dir = "/tmp")

3.3.1 Data Frame Summary

mydata
Dimensions: 40 x 7
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 pred1
[integer]
Mean (sd) : 2.5 (1.1)
min < med < max:
1 < 2.5 < 4
IQR (CV) : 1.5 (0.5)
1 : 10 (25.0%)
2 : 10 (25.0%)
3 : 10 (25.0%)
4 : 10 (25.0%)
0
(0.0%)
2 pred2
[integer]
Mean (sd) : 3 (1.4)
min < med < max:
1 < 3 < 5
IQR (CV) : 2 (0.5)
1 : 8 (20.0%)
2 : 8 (20.0%)
3 : 8 (20.0%)
4 : 8 (20.0%)
5 : 8 (20.0%)
0
(0.0%)
3 groupvar1
[character]
1. A
2. B
20 (50.0%)
20 (50.0%)
0
(0.0%)
4 groupvar2
[character]
1. A
2. B
3. C
10 (25.0%)
20 (50.0%)
10 (25.0%)
0
(0.0%)
5 weight
[numeric]
Min : 0
Mean : 0.8
Max : 1
0 : 10 (25.0%)
1 : 30 (75.0%)
0
(0.0%)
6 missvar_imputed
[numeric]
Mean (sd) : 1.5 (0.5)
min < med < max:
1 < 1.5 < 2
IQR (CV) : 1 (0.3)
1.00 : 19 (47.5%)
1.44!: 1 ( 2.5%)
1.56!: 1 ( 2.5%)
2.00 : 19 (47.5%)
! rounded


0
(0.0%)
7 missvar_missingind
[numeric]
Min : 0
Mean : 0
Max : 1
0 : 38 (95.0%)
1 : 2 ( 5.0%)
0
(0.0%)

3.4 Two Way Factor Plots

Once the data has been cleaned properly, we can investigate relationships between predictors and the target variable. This may be a step you do during cleaning as well. Recall this template is only a guideline.

# withingroup_plot<-function(data, cat, variable, label){
#   ggplot(data, aes=as.factor(cat),fill=as.factor(variable))+
#     labs(fill=label)+
#     geom_bar(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..]), position="dodge" ) +
#     geom_text(aes(y=..count../tapply(..count..,..x..,sum)[..x..],label=scales::percent(..count../tapply(..count..,..x..,sum)[..x..])), stat="count",position=position_dodge(.9),vjust=-.5)+
#     xlab(label)+
#     ylab("Percent")+
#     scale_y_continuous(labels=scales::percent)+theme(axis.text.x = element_text(angle=90,vjust=.5,hjust=1))
# }
# 
# withingroup_plot(mydata,mydata$groupvar2, mydata$groupvar1, 'groupvar1')

3.5 Two Way Numeric Plots

If you want content to show up below your tabs regardless of which tab you’re on, use this code.

4 Appendix

Here, we provide some fun tricks to make your analysis more visually appealing.

Use this code to write in colored font.

Use this code to write in bolded font.

Use this code to do both at once.

Colors are controlled by codes that come after “color:” throughout the code. You can google colors you like and change these codes to tailor colors to your personal preference. Google even has a nice “color picker” that will let you do a sliding scale to find the code for the color you want, and you can match colors here: https://imagecolorpicker.com/en. I haven’t figured out how to change the color of the tab titles and chapter selector from the default R blue yet, but the rest is entirely malleable.

To do calculations in text, do 4

If you’re familiar with Latex, use the dollar signs to use it like this \(\frac{1}{2}\), or doubles to make a new line equation: \[\frac{1}{2}\]

To add scroll bars to a table, use this code:

mydata %>%
  kable(format = "html", col.names = colnames(mydata)) %>%
  kable_styling() %>%
  kableExtra::scroll_box(width = "100%", height = "300px")
pred1 pred2 groupvar1 groupvar2 weight missvar_imputed missvar_missingind
1 1 A A 1 1.444444 1
2 2 A A 1 1.000000 0
3 3 A A 1 2.000000 0
4 4 A A 1 1.000000 0
1 5 A A 1 2.000000 0
2 1 A A 1 1.000000 0
3 2 A A 1 2.000000 0
4 3 A A 1 1.000000 0
1 4 A A 1 2.000000 0
2 5 A A 1 1.000000 0
3 1 A B 1 2.000000 0
4 2 A B 1 1.000000 0
1 3 A B 1 2.000000 0
2 4 A B 1 1.000000 0
3 5 A B 1 2.000000 0
4 1 A B 1 1.000000 0
1 2 A B 1 2.000000 0
2 3 A B 1 1.000000 0
3 4 A B 1 2.000000 0
4 5 A B 1 1.000000 0
1 1 B B 1 2.000000 0
2 2 B B 1 1.000000 0
3 3 B B 1 2.000000 0
4 4 B B 1 1.000000 0
1 5 B B 1 2.000000 0
2 1 B B 1 1.000000 0
3 2 B B 1 2.000000 0
4 3 B B 1 1.000000 0
1 4 B B 1 2.000000 0
2 5 B B 1 1.000000 0
3 1 B C 0 2.000000 0
4 2 B C 0 1.000000 0
1 3 B C 0 2.000000 0
2 4 B C 0 1.000000 0
3 5 B C 0 2.000000 0
4 1 B C 0 1.000000 0
1 2 B C 0 2.000000 0
2 3 B C 0 1.000000 0
3 4 B C 0 2.000000 0
4 5 B C 0 1.555556 1